"Mapping the Meta" by Toby Murray Live captioning by Norma Miller. @whitecoatcapxg >> Yeah, thanks. So yeah, let's get meta with it. Oh, is this not working? >> Let me try that mic. >> Is this better? >> OK. So I prepared this presentation based on the weekly chain set dump file that's created along with the planet file on planet.osm.org, the data I'm using is as of July 4th and I stuffed it all into ha Postgres database with a script that I wrote that's out on GitHub and then I just started doing some queries and seeing what interesting data popped up. And there are many full-table scans to be done. So I'm glad SSD has been invented at this point in the history. So what is a changeset? If you're not aware of this, it's a nonatomic collection of related edits, which is kind of an academic phrase, but generally, it's an editing session, so every time you hit save, it creates a changeset, uploads the data to that changeset and closes it. You can add more stuff to it later, although the server will close it on you if you leave it open for more than one hour. They haven't always existed. It was introduced along with API version 0.6 and all the edits before then were just on their own so they created some changesets to take care of those. So 870,000 of those. So what information do we have on each changeset? So we've got the user information, so who made the edit, who made the edit, the time that they submitted the changes OpenStreetMap server and the number of objects that the changeset touched and like with every object in OSM, it has tags that can be generally free form. Generally we always try to set a comment and most editors set a created by tag to indicate which software was used to make the edit. So as best I can tell, this is the first real changeset that was made after the API0.6 was released, just adding some buildings on a university campus. Nothing special. So overall we have 40 million and a half changesets. There are 2 million empty change sets and this is a combination of the old versions of potlach used to open a changeset as soon as you launched the editor, even if you didn't make any changes, but there was also some kind of script or something that went rogue in the middle of 2014 and created 500,000 empty changesets for in reason whatsoever until it was banned. So the total number of objects recorded is just over 6 billion. In theory, this should match the sum of all of the version you numbers of all of the objects in the OpenStreetMap database but I'm guessing it doesn't really match up. Data consistency is hard over a 15-year project and while I started analyzing some of the spatial aspects of the changesets, I started getting errors from Postgres saying hey, you've got invalid geometry. And I'm like, how can I be invalid geometry? It's a box, well, it turns out if your latitude and longitude is 214, it is indeed invalid. How were these create in I have no idea. [laughter] Undoubtedly some bugs in early versions of the API. I think I haven't seen -- I think the last one of these I saw was in 2014, maybe, I don't remember off the top of my head. So I mentioned one of the tags we like to set is a comment tag. And this can be used to -- it's really a way of interacting with the community, the rest of the editing community, to explain your edit, or highlight specific aspects about the edit that other users might find useful in the future. A lot of times I'll see like a business created an account just to add themselves to OpenStreetMap but they messed up the tagging on their business, like a lawyer just put like their name and nothing else, but because they left a changeset comment that said adding myself to the map and the account name matches the lawyer's name and so then I can go back and fix it pretty easily, because I know what they were intending to do, even if they didn't do it right. There's also -- it can certainly help future mappers, if they look at something and think it's not quite right. I often take a look at the changeset in which the data was created or updated. So if someone comes through Kansas and says, you know, changing all roads in Kansas to surface equals dirt, because Kansas is full of farms. [laughter] So that must be correct, if I see that in changeset comment, I'm going to be like, no, that's ridiculous, I'm going to revert that. But if they say updated roads in Kansas along my route through the state and added surface tags, then that's probably correct. So I made a word cloud out of all of the changeset comments. Apparently all the cool kids in OpenStreetMap add roads from Bing. [laughter] So some things I noticed in this, though: The words add, added, that kind of stuff, is much more common than updated or fixed. [laughter] I think this means we haven't finished mapping the world yet, so we are still adding data. At some point we may get to a steady state where we're updating more than we're adding. Probably a few years in the future. And you'll notice, just under the R from the road, there's a B box word down there, that's actually an artifact from an older editor that's not used much any more, Mercator, and it put as its changeset comment the bounding box and the number of changes and something else automated in the changeset comments and I'm like, well, that's kind of silly, because that's all recorded anyway by the server, so I don't know why they did that, but: So this is for all the changesets since 2009. Here is, for just the year 2010, the first full year that changesets existed. So you notice there is -- there is actually Bing, but it's very obscure, tiny. And you'll notice somewhere towards the top, it actually says Yahoo. Who remembers mapping from Yahoo imagery? [laughter] Must have a lot of new users here. The original potlach, potlach 1 defaulted to using Yahoo imagery back when I started. You'll also notice the B box word there is more prominent, the Mercator editor was more popular back then. And I notice there are a few more German words in the early days. So -- here is the same thing for the year 2015. Apparently HOT has had an impact on OpenStreetMap. Bing, I was actually rather surprised to see this, that Bing was reduced in prominence by the HOT OSM tag, and yeah, I think -- and this doesn't take changeset size into account, like the number of objects. It's just number of changesets, and a lot of HOT activities are, you know they encourage you to save often so you don't conflict with other people who are editing next to you, so it creates a lot of changesets. Um, so this is a map conference. I haven't shown any maps yet. So let's get to a map. This is the bounding boxes for all of my changesets, in case you hadn't figured it out, I live in Kansas. [laughter] I have done some edits around the nation, county borders and that kind of stuff, so I get around. So here is the average size of a changeset and I was pretty surprised at how big it was. And then I realized that the average was a terrible statistic to use. [laughter] And in fact, the median changeset size. [laughter] Is about 100 meters by 100 meters, and of course the large changesets that span the country or the continent or the globe skew the average heavily. So I decided to look at this some more and looked at all changesets less than 1 kilometer in size and it turns out most of them are really tiny. So probably editing one or two points on the map. And to me, this is good. The more local our edits are, the more accurate I think they are, you know, if you're sitting at a restaurant, pull out your phone and add information while you're sitting there, it's probably correct. If you're trying to update all the restaurants in North America, you're probably going to make some mistakes and there's a whole automated code of conduct for automated edits if you are trying to do this, so I want to see more of this, more local edits, more small edits, so next I looked at edits by state, and this is just total changesets, I reduced the changesets to a centroid, and eliminated ones that were larger than about a quarter the size of Kansas. And so it turns out Texas is huge. And New York and California have a lot of people. So it's not all that interesting of a map. Colorado stands out a little bit. Not sure why it had so many more changesets. But if you really want to leave an impact on the map, you can go to North Dakota. [laughter] So here I evened it out by population, and it's interesting to see that Texas and New York both lost heavily on this map. And California did a little bit, but not as much as those two, for sure, and some of the less populated states jump out more on this map, South Dakota and Wyoming in particular. So that was kind of interesting to see. As far as number of objects changed, the API has a limit of 50,000 objects. The API also had in off by one error and so there are changesets with 50,001 objects. I believe that's been fixed in the meantime yeah, the average size is 156 objects per changeset. And the median again is substantially lower. So let's look at some data sources. ID automatically adds an imagery used tag that is determined by what images you have enabled while you make edits and JOSM now strongly encourages adding a source tag. It used to be much you had to try to add it but now they ask you to add it. So here's usage of the source and imagery use tags over over time and this is a logarithmic scale because otherwise up through 2012 would have been nothing, but it does hide the fact that between 2012 and 2013, the usage of the source tag doubled from 4,000 to 120,000. Clearly Bing is a prominent source for us, but also local knowledge is pretty prominent and this word cloud generator just takes individual words, so the local and the knowledge usually go together in changeset comments that I've seen, and there's survey is also very prevalent, which is good to see. We like people getting out and actually looking at their map data and you can see Mapillary up in the upper left. So people are starting to use that. Here is a similar thing for the imagery used tag. And so this is from ID only. The local GPX is what ID puts in if you load a GPX file from your computer. -- I'm getting a stop sign here. Don't I have five more minutes? No? OK. I'm sitting at 15 minutes here in my timer. Anyway. ... I guess I'll just quickly cover this. So have you ever been asked, how do you prevent vandalism in OpenStreetMap. It's always a question I get when I try to explain it to people. We get bad edits, bad imports, you name it. Well, we don't prevent it, we just revert it. So I looked at how many changesets there were with revert or vandalism in the comment, and there's quite a number, but at the end of the day, that 43 million, compared to the 6 billion, is a pretty small drop in the bucket. So we don't generally have a big problem with vandalism. Here's revert activity by year. Of the spike there is a single user in 2009, a single user reverted a bunch of small changesets, I'm not sure why. And we have a pretty healthy suite of mobile editors. I mapped out their usage over time. That spike in January 2015 is again, actually a single user who apparently created 7,000 object changesets in an iOS application. I have no idea how that's even possible, but -- [laughter] So all right, I guess I will cut it off there. I have a few more slides, but I'll make them available somewhere. I have some stuff about hashtags. Where HOT is very prominent, but I guess we will call it good there. [applause] >> Any questions? >> >> Do you have the link to the slides? >> Do I have a link to the slides? No, not yet, I was still tinkering with them this morning, so yes, I'll tweet it out. You can see me on Twitter there. Yeah? >> AUDIENCE MEMBER: There a median number of completes about conflicts? >> So you're talking about when you're trying to upload something and it conflicts with somebody else or are you talking about edit wars? >> >> I'm talking about when it conflicts with somebody else. >> No, you can't really see that in this data, because this is just changesets after they've been uploaded and all the conflicts have been resolved. >> So we don't have that data? >> Yeah, we don't have that data. All right. [applause]